[DE-7859] Expose pHash on DatasetItem (v0.18.3)#461
Open
vinay553 wants to merge 2 commits into
Open
Conversation
Add a `phash` field to the DatasetItem dataclass and thread it through `from_json`. Because every SDK method that returns a DatasetItem (items_and_annotation_generator, items_generator, query_items, dataset.items, iloc/refloc/loc) deserializes through DatasetItem.from_json, exposing the field there is sufficient — no per-method changes required. Also adds a top-level CLAUDE.md with release/branch conventions and architecture pointers for future Claude Code sessions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Expose the perceptual-hash (pHash) of dataset items through the SDK so ML workflows (dedup, near-duplicate detection) can access it without a separate fetch.
phash: Optional[str]field to theDatasetItemdataclass — 64-character "0/1" binary string when backfilled by the backend,Noneotherwise.phash=payload.get(PHASH_KEY)intoDatasetItem.from_json. Because every SDK method that returns aDatasetItemgoes throughfrom_json, this single change exposesitem.phashon:items_and_annotation_generatoritems_generator/dataset.itemsquery_itemsiloc/refloc/loc0.18.2 → 0.18.3and adds a CHANGELOG entry per the project's Keep-a-Changelog convention.CLAUDE.mdcapturing release workflow, branch/PR conventions, and thefrom_json-centralization insight for future agent sessions.Test plan
poetry install && poetry run python -c "from nucleus.dataset_item import DatasetItem; print(DatasetItem.from_json({'reference_id':'r', 'image_url':'x.jpg', 'phash':'1'*64}).phash)"prints the hash.DatasetItem.from_jsonfalls back toNonewhen the backend omitsphash(existing test fixtures).client.get_dataset(...).items_and_annotation_generator(...)yields items withitem.phashpopulated.🤖 Generated with Claude Code
Greptile Summary
This PR exposes the backend-computed perceptual hash (
phash) onDatasetItemby adding a newOptional[str]field and threading it through the singlefrom_jsondeserialization entry point, making it available across all SDK methods that return items.PHASH_KEY = "phash"constant andphash: Optional[str] = Nonefield to theDatasetItemdataclass;phashis intentionally omitted fromto_payloadsince it is read-only and computed by the Nucleus backend.0.18.3and prepends a matching CHANGELOG entry following the project's Keep-a-Changelog convention.CLAUDE.mdto document the repo's release workflow and architecture for future AI-assisted sessions.Confidence Score: 5/5
This is a purely additive, backwards-compatible change — a new optional field defaulting to
Nonewith no effect on existing serialization or upload paths.The change is minimal and surgical: one constant, one dataclass field, one
payload.getcall infrom_json. The field defaults toNone, so all existing callers and test fixtures continue to work unchanged.to_payloadis correctly left untouched sincephashis backend-computed and should not be round-tripped in uploads.No files require special attention.
Important Files Changed
phash: Optional[str] = Nonefield to the dataclass and threads it throughfrom_json;to_payloadintentionally omits it since phash is backend-computed and read-only.PHASH_KEY = "phash"constant in alphabetical order, following the existing naming convention.phashfield.Sequence Diagram
sequenceDiagram participant C as Caller (user code) participant SDK as SDK method<br/>(items_generator / query_items / iloc / etc.) participant FJ as DatasetItem.from_json participant API as Nucleus REST API C->>SDK: call SDK method SDK->>API: GET /v1/nucleus/... API-->>SDK: JSON payload (includes "phash" when backfilled) SDK->>FJ: from_json(payload) FJ->>FJ: payload.get(PHASH_KEY) → phash or None FJ-->>SDK: "DatasetItem(phash="0101...", ...)" SDK-->>C: DatasetItem with .phash populatedReviews (2): Last reviewed commit: "Tighten phash field comment" | Re-trigger Greptile